33 research outputs found
Evaluating Inter-Bilingual Semantic Parsing for Indian Languages
Despite significant progress in Natural Language Generation for Indian
languages (IndicNLP), there is a lack of datasets around complex structured
tasks such as semantic parsing. One reason for this imminent gap is the
complexity of the logical form, which makes English to multilingual translation
difficult. The process involves alignment of logical forms, intents and slots
with translated unstructured utterance. To address this, we propose an
Inter-bilingual Seq2seq Semantic parsing dataset IE-SEMPARSE for 11 distinct
Indian languages. We highlight the proposed task's practicality, and evaluate
existing multilingual seq2seq models across several train-test strategies. Our
experiment reveals a high correlation across performance of original
multilingual semantic parsing datasets (such as mTOP, multilingual TOP and
multiATIS++) and our proposed IE-SEMPARSE suite.Comment: 21 pages, 9 figures, 15 table
Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages
We create publicly available language identification (LID) datasets and
models in all 22 Indian languages listed in the Indian constitution in both
native-script and romanized text. First, we create Bhasha-Abhijnaanam, a
language identification test set for native-script as well as romanized text
which spans all 22 Indic languages. We also train IndicLID, a language
identifier for all the above-mentioned languages in both native and romanized
script. For native-script text, it has better language coverage than existing
LIDs and is competitive or better than other LIDs. IndicLID is the first LID
for romanized text in Indian languages. Two major challenges for romanized text
LID are the lack of training data and low-LID performance when languages are
similar. We provide simple and effective solutions to these problems. In
general, there has been limited work on romanized text in any language, and our
findings are relevant to other languages that need romanized language
identification. Our models are publicly available at
https://ai4bharat.iitm.ac.in/indiclid under open-source licenses. Our training
and test sets are also publicly available at
https://ai4bharat.iitm.ac.in/bhasha-abhijnaanam under open-source licenses.Comment: Accepted to ACL 202
CTQScorer: Combining Multiple Features for In-context Example Selection for Machine Translation
Large language models have demonstrated the capability to perform on machine
translation when the input is prompted with a few examples (in-context
learning). Translation quality depends on various features of the selected
examples, such as their quality and relevance, but previous work has
predominantly focused on individual features in isolation. In this paper, we
propose a general framework for combining different features influencing
example selection. We learn a regression model, CTQ Scorer (Contextual
Translation Quality), that selects examples based on multiple features in order
to maximize the translation quality. On multiple language pairs and language
models, we show that CTQ Scorer helps significantly outperform random selection
as well as strong single-factor baselines reported in the literature. We also
see an improvement of over 2.5 COMET points on average with respect to a strong
BM25 retrieval-based baseline.Comment: Accepted to EMNLP 2023 finding